Fast Search Methods for Biological Sequence Databases
نویسندگان
چکیده
Biology researchers have a pressing need for data management technologies which will make the storage and retrieval of DNA and protein sequence data accurate and e cient. The volume of data generated by DNA sequencing is already massive and will continue to grow rapidly. Even if the current sequence databases are adequate today, they most assuredly will become inadequate in the future when far more sequence data has been determined. The direction of future research in sequence databases needs to be in the organization of information. This is so that the volume of data needing to be searched does not grow linearly with the volume of sequence data being discovered. We propose to develop an index structure and retrieval system called PROXIMAL for biological sequence databases which promises to be e cient and general. This organization of the databases will complement other current e orts at sequence comparison and analysis, by providing an infrastructure in which other methods can be used to e ciently locate desired sequences. Our method relies on the use of reference strings to partition the database of sequences. It is e cient since the use of multiple reference strings for any given distance measure greatly reduces the number of sequences that must be examined, allowing us to quickly locate sequences based on a precomputed metric. It is general since multiple distance measures can be used. These include at least di ering gap and mismatch weights for the basic edit distance calculation, or entirely di erent models of mutation. The only requirement is that there is a metric structure | mainly, that the calculations satisfy the triangle inequality. This is a weak requirement that is satis ed by many interesting measures, including those currently in wide use for sequence comparison. Sequence Databases in Molecular Biology
منابع مشابه
Protein Databases
Proteins are sources of many peptides with diverse biological activity. Some of them are considered as valuable components of foods and drug targets with desired and designed biological activity. We are now entering an era rich in biological data in which the field of bioinformatics is poised to exploit this information in increasingly powerful ways. There are currently many databases all over ...
متن کاملgpALIGNER: A Fast Algorithm for Global Pairwise Alignment of DNA Sequences
Bioinformatics, through the sequencing of the full genomes for many species, is increasingly relying on efficient global alignment tools exhibiting both high sensitivity and specificity. Many computational algorithms have been applied for solving the sequence alignment problem. Dynamic programming, statistical methods, approximation and heuristic algorithms are the most common methods appli...
متن کاملOptimal Protein Encoding
A common task in Bioinformatics is the search of sequence databases for matching sequences. In protein sequence databases, searching is hindered by both the increased amount of data and the complexity of sequence similarity metrics. Protein similarity is not simply a matter of character matching, but rather is determined by a matrix of scores assigned to every match and mismatch [5]. One strate...
متن کاملODM BLAST: Sequence Homology Search in the RDBMS
Performing sequence homology searches against DNA or protein sequence databases is an essential bioinformatics task. Past research efforts have been primarily concerned with the development of sensitive and fast sequence homology search algorithms outside of the relational database management system (RDBMS). Oracle Data Mining (ODM) BLAST enables BLAST to be performed in a RDBMS. ODM BLAST reli...
متن کاملCluster - preserving embedding of proteins by
Similarity searching in protein sequence databases is a standard technique for biologists dealing with a newly sequenced protein. Exhaustive search in such databases is prohibitive because of the large sizes of these database and because pairwise comparisons are slow. Heuristic techniques, such as FASTA and BLAST, are useful because they are fast and accurate, though it has been shown that exha...
متن کاملCluster - preserving embedding of proteins by Gabriela
Similarity searching in protein sequence databases is a standard technique for biologists dealing with a newly sequenced protein. Exhaustive search in such databases is prohibitive because of the large sizes of these database and because pairwise comparisons are slow. Heuristic techniques, such as FASTA and BLAST, are useful because they are fast and accurate, though it has been shown that exha...
متن کامل